Language identi cation of web documents using discrete HMMs

نویسندگان

  • A. Xafopoulos
  • C. Kotropoulos
چکیده

Automatic language identi cation in written text documents is an issue which deserves signi cant attention in the context of the ever-growing volume of web documents. This paper deals with language identi cation in the domain of electronic texts related to tourism. The proposed system is built on Hidden Markov Models (HMMs) that enable the modeling of character sequences. For this purpose, a parallel structure of ergodic discrete HMMs is used. During testing a previously unseen document is divided into its sentences and each of them is independently characterized in terms of the language it is written in. Experiments conducted on sentence-long documents demonstrated high identi cation rates.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automated analysis of dynamic web services

Web applications appear as mazes to a user. Using a web browser, the user explores each web page without seeing the structure of the entire service. For a software tester, it would be convenient to have a map, in form of a graph, describing the functional topology of the service. In that way, it would be possible to analyse the possible paths which can be navigated to discover redundancies and ...

متن کامل

SECURING INTERPRETABILITY OF FUZZY MODELS FOR MODELING NONLINEAR MIMO SYSTEMS USING A HYBRID OF EVOLUTIONARY ALGORITHMS

In this study, a Multi-Objective Genetic Algorithm (MOGA) is utilized to extract interpretable and compact fuzzy rule bases for modeling nonlinear Multi-input Multi-output (MIMO) systems. In the process of non- linear system identi cation, structure selection, parameter estimation, model performance and model validation are important objectives. Furthermore, se- curing low-level and high-level ...

متن کامل

Identifying Topics for Web Documents through Fuzzy Association Learning

Due to the explosive growth of available information on the World Wide Web (WWW), users have su ered from the information overload. To alleviate this problem, there is a need for an intelligent tool to help the users screening and ltering for interesting and useful information. In this paper, a method of automatically identifying topics for Web documents via a classi cation technique is propose...

متن کامل

O M 4 . 1 The Subject Database 4 . 2 Experiment Plan 5 . 1 Varying the Overlap 4 Experimental Setup 5 Parameterisation Results

Stochastic modelling of non-stationary vector timeseries based on HMMs has been very successful for speech applications [5]. Recently it has been applied to a range of image recognition problems [7, 9]. Previously reported work [6] has investigated the use of HMMs to model human faces for identi cation purposes. Faces can be intuitively divided into regions such as the mouth, eyes, nose, etc., ...

متن کامل

Identication of Structural Dynamic Discrete Choice Models

This paper presents new identi…cation results for the class of structural dynamic discrete choice models that are built upon the framework of the structural discrete Markov decision processes proposed by Rust (1994). We demonstrate how to semiparametrically identify the deep structural parameters of interest in the case where utility function of one choice in the model is parametric but the dis...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002